FAST MONOTONE SUMMATION OVER DISJOINT SETS* 



PETTERI KASKI 1 , MIKKO KOIVISTO 2 , AND JANNE H. KORHONEN 2 

Abstract. We study the problem of computing an ensemble of multiple 
sums where the summands in each sum are indexed by subsets of size p 
of an n-element ground set. More precisely, the task is to compute, for 
each subset of size q of the ground set, the sum over the values of all 
subsets of size p that are disjoint from the subset of size q. We present 
an arithmetic circuit that, without subtraction, solves the problem using 
0((n p + n v ) logn) arithmetic gates, all monotone; for constant p, q this 
is within the factor logn of the optimal. The circuit design is based 
on viewing the summation as a "set nucleation" task and using a tree- 
projection approach to implement the nucleation. Applications include 
improved algorithms for counting heaviest fc-paths in a weighted graph, 
computing permanents of rectangular matrices, and dynamic feature 
selection in machine learning. 



1. Introduction 

1.1. Weak algebrisation. Many hard combinatorial problems benefit from 
algebrisation, where the problem to be solved is cast in algebraic terms as 
the task of evaluating a particular expression or function over a suitably 
rich algebraic structure, such as a multivariate polynomial ring over a finite 
field. Recent advances in this direction include improved algorithms for 
the fc-path [2BJ, Hamiltonian path [4j, ^-coloring [9J, Tutte polynomial [BJ, 
knapsack [22], and connectivity [H] problems. A key ingredient in all of these 
advances is the exploitation of an algebraic catalyst, such as the existence of 
additive inverses for inclusion-exclusion, or the existence of roots of unity 
for evaluation/interpolation, to obtain fast evaluation algorithms. 

Such advances withstanding, it is a basic question whether the catalyst 
is necessary to obtain speedup. For example, fast algorithms for matrix 
multiplication |11[ [T3] (and combinatorially related tasks such as finding a 
triangle in a graph [U [T7] ) rely on the assumption that the scalars have a 
ring structure, which prompts the question whether a weaker structure, such 
as a semiring without additive inverses, would still enable fast multiplication. 
The answer to this particular question is known to be negative [E], but for 
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many of the recent advances such an analysis has not been carried out. In 
particular, many of the recent algebrisations have significant combinatorial 
structure, which gives hope for positive results even if algebraic catalysts are 
lacking. The objective of this paper is to present one such positive result by 
deploying combinatorial tools. 

1.2. A lemma of Valiant. Our present study stems from a technical lemma 
of Valiant |23] encountered in the study of circuit complexity over a mono- 
tone versus a universal basis. More specifically, starting from n variables 
/i) /2> ■ ■ • j fm the objective is to use as few arithmetic operations as possible 
to compute the n sums of variables where the jth sum ej includes all the 
other variables except the variable /,-, where j = 1,2, ... ,n. 

If additive inverses are available, a solution using 0(n) arithmetic oper- 
ations is immediate: first take the sum of all the n variables, and then for 
j = 1, 2, . . . ,n compute ej by subtracting the variable fj. 

Valiant [23] showed that 0{n) operations suffice also when additive inverses 
are not available; we display Valiant's elegant combinatorial solution for n = 8 
below as an arithmetic circuit. [[Please see Appendix |A| for the general case.]] 
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1.3. Generalising to higher dimensions. This paper generalises Valiant's 
lemma to higher dimensions using purely combinatorial tools. Accordingly, 
we assume that only very limited algebraic structure is available in the form 
of a commutative semigroup (S, ©). That is, © satisfies the associative law 
x © (y © z) = (x © y) © z and the commutative law x © y = y © x for all 
x,y, z S S, but nothing else is assumed. 

By "higher dimensions" we refer to the input not consisting of n values 
("variables" in the example above) in S, but rather (™) values f(X) £ S 
indexed by the p- subsets X of [n] = {1, 2, . . . , n}. Accordingly, we also allow 
the output to have higher dimension. That is, given as input a function 
/ from the p-subsets [n] to the set S, the task is to output the function e 
defined for each g-subset Y of [n] by 

(1) <Y)= f{X), 

X:Xf\Y=% 

where the sum is over all p-subsets X of [n] satisfying the intersection 
constraint. Let us call this problem (p,q)- disjoint summation. 

In analogy with Valiant's solution for the case p = q = 1 depicted above, 
an algorithm that solves the {p, g)-disjoint summation problem can now be 
viewed as a circuit consisting of two types of gates: input gates indexed by 
p-subsets X and arithmetic gates that perform the operation ©, with certain 
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arithmetic gates designated as output gates indexed by (/-subsets Y. We 
would like a circuit that has as few gates as possible. In particular, does 
there exist a circuit whose size for constant p, q is within a logarithmic factor 
of the lower bound Q(n p + n q )l 

1.4. Main result. In this paper we answer the question in the affirmative. 
Specifically, we show that a circuit of size 0{(n p +n q ) log re) exists to compute 
e from / over an arbitrary commutative semigroup (5,©), and moreover, 
there is an algorithm that constructs the circuit in time 0{{p 2 + q 2 )(n p + 
re 9 ) log 3 re). These bounds hold uniformly for all p, q. That is, the coefficient 
hidden by O-notation does not depend on p and q. 

From a technical perspective our main contribution is combinatorial and 
can be expressed as a solution to a specific set nucleation task. In such a task 
we start with a collection of "atomic compounds" (a collection of singleton 
sets), and the goal is to assemble a specified collection of "target compounds" 
(a collection of sets that are unions of the singletons). The assembly is to be 
executed by a straight-line program, where each operation in the program 
selects two disjoint sets in the collection and inserts their union into the 
collection. (Once a set is in the collection, it may be selected arbitrarily 
many times.) The assembly should be done in as few operations as possible. 

Our main contribution can be viewed as a straight-line program of length 
0((n p + n q )logn) that assembles the collection {{X : X DY = $} : Y} 
starting from the collection {{A} : X}, where X ranges over the p-subsets of 
[n] and Y ranges over the g-subsets of [re]. Valiant's lemma [23] in these terms 
provides an optimal solution of length O(re) for the specific case p = q = 1. 

1.5. Applications. Many classical optimisation problems and counting 
problems can be algebrised over a commutative semigroup. A selection 
of applications will be reviewed in Sect. [3j 

1.6. Related work. "Nucleation" is implicit in the design of many fast 
algebraic algorithms, perhaps two of the most central are the fast Fourier 
transform of Cooley and Tukey [T2| (as is witnessed by the butterfly circuit 
representation) and Yates's 1937 algorithm [27 for computing the product 
of a vector with the tensor product of n matrices of size 2x2. The latter 
can in fact be directly used to obtain a nucleation process for (p, q^-disjoint 
summation, even if an inefficient one. (For an exposition of Yates's method we 
recommend Knuth [201 §4.6.4]; take rrii = 2 and gi(si, ti) = [si = or U = 0] 
for i = l,2,...,nto extract the following nucleation process implicit in the 
algorithm.) For all Z C [n] and i € {0, 1, . . . , n}, let 



(2) ai(Z) = {X C [re] : X n [n - i) = Z n [re - i], X n Z \ [n - i] = 0} . 



Put otherwise, di(Z) consists of X that agree with Z in the first n—i elements 
of [re] and are disjoint from Z in the last i elements of [re]. In particular, our 
objective is to assemble the sets a n (Y) = {X : X n Y = 0} for each Y C [re] 
starting from the singletons clq(X) = {X} for each X C [re]. The nucleation 
process given by Yates' algorithm is, for alH = 1, 2, . . . , re and Z C [re], to set 



(3) 
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Figure 1. Representing {0, l}-strings of length at most b 
as nodes in a perfect binary tree of height b. Here 6 = 4. 
(a) Each string traces a unique path down from the root node, 
with the empty string e corresponding to the root node. The 
nodes at level < I < b correspond to the strings of length 
I. The red leaf node corresponds to 0110 and the blue node 
corresponds to 101. (b) A set of strings corresponds to a set 
of nodes in the tree. The set X is displayed in red, the set 
W in blue. The set W is the projection of the set X to level 
1 = 2. Equivalently, X\ t = W. 

This results in 2 n ~ 1 n disjoint unions. If we restrict to the case \Y\ < q 
and \X\ < p, then it suffices to consider only Z with \Z\ < p + q, which 
results in 0((p + q) Ylj=o (j)) disjoint unions. Compared with our main 
result, this is not particularly efficient. In particular, our main result relies 
on "tree-projection" partitioning that enables a significant speedup over the 
"prefix-suffix" partitioning in ([2]) and pi). 

We observe that "set nucleation" can also be viewed as a computational 
problem, where the output collection is given and the task is to decide 
whether there is a straight-line program of length at most £ that assembles 
the output using (disjoint) unions starting from singleton sets. This problem 
is known to be NP-complete even in the case where output sets have size 3 [Tp) 
Problem P09]; moreover, the problem remains NP-complete if the unions 
are not required to be disjoint. 

2. A Circuit for (p, ^-Disjoint Summation 

2.1. Nucleation of p-subsets with a perfect binary tree. Looking at 
Valiant's circuit construction in the introduction, we observe that the left 
half of the circuit accumulates sums of variables (i.e., sums of 1-subsets of 
[n]) along what is a perfect binary tree. Our first objective is to develop 
a sufficient generalisation of this strategy to cover the setting where each 
summand is indexed by a p-subset of [n] with p > 1. 

Let us assume that n = 2 b for a nonnegative integer b so that we can 
identify the elements of [n] with binary strings of length b. We can view each 
binary string of length b as traversing a unique path starting from the root 
node of a perfect binary tree of height b and ending at a unique leaf node. 
Similarly, we may identify any node at level t of the tree by a binary string 
of length £, with < t < b. See Fig. [ija) for an illustration. For p = 1 this 
correspondence suffices. 

For p > 1, we are not studying individual binary strings of length b (that is, 
individual elements of [n]), but rather p-subsets of such strings. In particular, 
we can identify each p-subset of [n] with a p-subset of leaf nodes in the binary 
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tree. To nucleate such subsets it will be useful to be able to "project" sets 
upward in the tree. This motivates the following definitions. 

Let us write {0, 1}^ for the set of all binary strings of length < £ < b. 
For £ = 0, we write e for the empty string. For a subset X C {0, l} b , we 
define the projection of X to level £ as 

(4) X\ t = {x G {0, \Y : 3y G {0, such that xy G x} . 

That is, X\n is the set of length-^ prefixes of strings in X. Equivalently, in 
the binary tree we obtain X\i by lifting each element of X to its ancestor on 
level £ in the tree. See Fig. [l|b) for an illustration. For the empty set we 
define 0|j = 0. 

Let us now study a set family 3~ C 2^ 0,1 ^ . The intuition here is that 
each member of 3" is a summand, and 3" represents the sum of its members. 
A circuit design must assemble (nucleate) 3" by taking disjoint unions of 
carefully selected subfamilies. This motivates the following definitions. 

For a level < £ < b and a string W C {0, 1}^ let us define the subfamily 
of 3~ that projects to W by 

(5) 3V = {X G J: X\ t = W} . 

That is, the family 3\y consists of precisely those members X G 3" that 
project to W. Again Fig. [l|b) provides an illustration: we select precisely 
those X whose projection is W. 

The following technical observations are now immediate. For each < 
£ < b, if G 3", then we have 

(6) 3-0 = {0}. 
Similarly, for £ = we have 

(7) g- {e}= j\{0}. 

For £ = b we have for every W G 3" that 

(8) 3V = {W} . 

Now let us restrict our study to the situation where the family 3" C 2^°' 1 ^ !> 
contains only sets of size at most p. In particular, this is the case in our 
applications. For a set U and an integer p, let us write y^j for the family 

of all subsets of U of size p, and QjQ for the family of all subsets of U with 
size at most p. Accordingly, for integers < k < n, let us use the shorthand 
(lie) = Si=o (») • 

The following lemma enables us to recursively nucleate any family 3" C 
(}°ip ) ' ^ n P ar ti cu l ar ) we can nucleate the family 3V with W in level £ using 
the families 3~z with Z in level £ + 1. Applied recursively, we obtain 3" by 
proceeding from the bottom up, that is, £ = b, b — 1, . . . , 1, 0. The intuition 
underlying the lemma is illustrated in Fig. |2j 

Lemma 1. For all < £ < b - 1, 3" C (^J, 1 '), and W G ( { °i/) , we have 
that the family 3^w is a disjoint union 3V = U \^z'- Z G 
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Figure 2. Illustrating the proof of Lemma [T] Here 6 = 5. 
The set X (indicated with red nodes) projects to level £ = 2 
to the set W (indicated with blue nodes) and to level £+1 = 3 
to the set Z (indicated with yellow nodes). Furtermore, the 
projection of Z to level £ is W. Thus, each X £ 9~ is included 
to 3*w exactly from $z in Lemma [Tj 



Proof. The projection of each X £ 9" to level £ + 1 is unique, so the families 
$z are pairwise disjoint for distinct Z. Now consider an arbitrary X £ 3~ 
and set X|^+i = that is, X £ 3"^. From (4| we have = which 
implies that we have X £ 3~jy if and only if X £ = W if and only if Z\i = W 
if and only if Z £ . □ 



4-p /iy 

2.2. A generalisation: (p, ^-intersection summation. It will be conve- 
nient to study a minor generalisation of (p, g)-disjoint summation. Namely, 
instead of insisting on disjointness, we allow nonempty intersections to occur 
with "active" (or "avoided") (/-subsets A, but require that elements in the 
intersection of each p-subset and each A are "individualized." That is, our 
input is not given by associating a value f(X) £ S to each set X £ , but 
is instead given by associating a value g(I, X) £ S to each pair (I, X) with 
/ C X £ , where I indicates the elements of X that are "individualized." 
In particular, we may insist (by appending to S a formal identity element if 
such an element does not already exist in S) that g(I, X) vanishes unless / 
is empty. This reduces (p, g)-disjoint summation to the following problem: 

Problem 2. ((p, (^-intersection summation) Given as input a function 
g that maps each pair (J, X) with / C X £ Q^jj and |/| < g to an element 

5 (/, X) £ 5, output the function h: (fj) -> 5 defined for all A £ (g) by 
(9) h(A)= ff (ini,i). 

2.3. The circuit construction. We proceed to derive a recursion for the 
function h using Lemma [T] to carry out nucleation of p-subsets. The recursion 
proceeds from the bottom up, that is, £ = b, b — 1, . . . , 1, in the binary tree 
representation. (Recall that we identify the elements of [n] with the elements 
of {0, l} b , where n is a power of 2 with n = 2 h '.) The intermediate functions 
hi computed by the recursion are "projections" of (|9|) using (pM). In more 



precise terms, for £ = b, b - 1, . . . , 1, 0, the function h e : (^ } J x T*°^ } J -> S 



is defined for all W G ( {0 ^ } ) and A G ( {0 £ }k ) by 
(10) MA,W0= <j(ini,i) 



{0,1} 6 



Let us now observe that we can indeed recover the function h from the case 
£ = 0. Indeed, for the empty string e, the empty set and every A G ) 
we have by ^ and ([7]) that 

(11) h(A) = h (A,{e})®h (A,<b). 

It remains to derive the recursion that gives us ho- Here we require one 
more technical observation, which enables us to narrow down the intermediate 
values hg(A, W) that need to be computed to obtain ho. In particular, we 
may discard the part of the active set A that extends outside the "span" of 
W. This observation is the crux in deriving a succinct circuit design. 

For < £ < b and w G {0, 1} , we define the span of w by 

( w ) = jx e {0, l} b : 3z G {0, l} b ~ e such that wz = x}. 

In the binary tree, (w) consists of the leaf nodes in the subtree rooted at w. 
Let us extend this notation to subsets W C {0, 1} by (W) = Uwew ( w ) • 
The following lemma shows that it is sufficient to evaluate he(A, W) only for 
W G ( {0 [p } ') and A G ( {0 |J } *) such that AC(W). 

Lemma 3. For allO < £ <b, W e and A G we We 

(12) h e (A,W) = h i (An(W),W). 

Proof. If We ^ € and = VF, then we have X C (W). 



Thus, directly from (10), we have 

h t (A,W)= g(AnX,X) 

= 5 ((An(W))nx,x) 



= ^(A n (W) , w) . □ 

We are now ready to present the recursion for £ = b, b — 1,... ,1,0. The 
base case £ = b is obtained directly based on the values of g, because we 
have by (8) for all W G and A G with A - W that 

(13) h(A,W) = g(A,W) . 

The following lemma gives the recursive step from £ + 1 to £ by combining 
Lemma [T] and Lemma [3j 

Lemma 4. For < £ < b - 1, W G ( { °[p } "), A G ( { °//) with A C (W), 
we /lave 

(14) MA^0= ^+i(An(z),z). 
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Proof. We have 

h t (A,W)= g(AnX,X) 

= 

= e 

= e 



g(AnX,X) 
h e+1 (A,Z) 

h i+1 (An{z),z). 



(10) 



(Lemma [T| 



(10) 



(Lemma |3| □ 



The recursion given by (13), (14), and (12) now defines an arithmetic circuit 
that solves (p, ^-intersection summation. 



2.4. Size of the circuit. By (13), the number of input gates in the circuit 
is equal to the number of pairs (I,X) with / C X G (^°ip ) an< ^ 1-^1 — 
which is 



(15) 



EE 

i=0 j=0 

To derive an expression for the number of ©-gates, we count for each 




< £ < b - 1 the number of pairs (A, W) with W G , A G ( {0 j* 

and A C (W), and for each such pair (A, W) we count the number of ©-gates 
in the subcircuit that computes the value he(A, W) from the values of /i^+i 
using ( Jl4| ). 

First, we observe that for each W G we have \{W)\ = 2 b ~ e \W\. 

Thus, the number of pairs (A, W) with W G 
A C (W) is 



(16) 



{0,1}' 

ip 



A G 



{0,1}" 
iq 



and 




For each such pair [A, W), the number of ©-gates for (14) is f^°'*p + ) 
Lemma 5. For all < £ < b - 1, W G 



{0,1} 

ip 



, and \W\ 



w 

i, we have 



1. 



(17) 





k=0 



Proof. A set Z G (^°'^p ) can contain either one or both of the strings 
u>0 and wl for each w G W . The set Z may contain both elements for at 
most p — i elements w G W because otherwise \Z\ > p. Finally, for each 

< k < p - i, there are (l)2 l ~ k ways to select a set Z G 



that Z contains wO and wl for exactly k elements w G W. 



such 
□ 



Finally, for each A G y'^g J we require an ©-gate that is also designated 
as an output gate to implement ( [IT] ) . The number of these gates is 

m 



The total number of ©-gates in the circuit is obtained by combining (15), 



(16), (17), and (18). The number of ©-gates is thus 



i=oj=o\"/ w e=o i=o j=o \ 1 J \ 3 ) \k=o \ / J j=o \ 3 

^=oi=oj=o \ / \ J / £=oi=oi=o J " 

^ P j? /Q^\max(p,g) -j /om— £\max(p,g) 

<eee ( \, ( — ^ 

= n max( P ,,) (1+log2n) ££^_ 
i=0j=0 •'' 

The double sum is at most a constant because we have that 

p q -jni oo oi oo -j oo ni oo /o 2\i 

( 19 ) ee^<e|e^=e^<e^ 

1=0.7=0 J ' i=0 i=0 J " i=0 i=0 

where the last inequality follows from Stirling's formula. Furthermore, 

00 (3e 2 )* 00 1 00 1 

i=r6e 2 l i=r6e 2 ] i=Q 



Combining (19) and (20), we have 

p q j i L6e 2 J ,n 2 )i 
i=0j=0 J ' i=0 

Thus, the circuit defined in Sj2]has size O^n^ + re 9 ) logn), where the constant 
hidden by the O-notation does not depend on p and q. 

2.5. Constructing the Circuit. In this section we give an algorithm that 
outputs the circuit presented above, given b, p, and q as input. This algorithm 
can also be used to compute ([9| directly without constructing the circuit 
first. 

Algorithm 6. Outputs a list of gates in the circuit, with labels on the input 
and output gates. 

1. Initialise an associative data structure D 

2. For each X G ) an d I G QQj create an input gate g labelled 
with (I, X) and set D(I, X) g. 

3. Set l^b-l. 

4. For each W G ( {0 ^ } ') and A G {^J), 
4.1. select an arbitrary Z G ( {0 ' 

lp ) w an< ^ se * ^ D(Ar\(Zo) , Zq), 



4.2. for each Z G ( {0 'j£ +1 ) \ {Z } set g <- g © L>(,4 n (Z) , Z), and 

4.3. set D(A, W) <- 5. 

5. If 4 > 1, set £ <- I - 1 and go to Step |4j 

6. For each A G ( { °|g }i> ), create an output gate D(A, {e}) © D(A,0) 
labelled with A 

Letting k = max(p, q), the sets that appear in Algorithm [6] can be repre- 
sented in Oik log n) space, and each required operation on these sets can be 
done in 0(A;logn) time. Thus, we observe that iterating over set families 
in Algorithm [6] takes O(klogn) time per element, and the total number 
of iterations the algorithm makes is same as the number of gates in the 
circuit. Also, assuming that D is a self-balancing binary tree, each search 
and insert operation takes 0((/clogn) 2 ) time, and a constant number of 
these operations is required for each gate. Thus, we have that the total 
running time of Algorithm [6] is 0{{p 2 + q 2 )(n p + n q ) log 3 n). 

3. Concluding Remarks and Applications 



We have generalised Valiant's [23J observation that negation is powerless 
for computing simultaneously the n different disjunctions of all but one of 
the given n variables: now we know that, in our terminology, subtraction is 
powerless for (p, q)-disjoint summation for any constant p and q. (Valiant 
proved this for p = q = 1.) Interestingly, requiring p and q be constants 
turns out to be essential, namely, when subtraction is available, an inclusion- 
exclusion technique is known [5] to yield a circuit of size 0(p(™ p ) + q(™ q )), 
which, in terms of p and q, is exponentially smaller than our bound 0((n p + 
n q )\ogn). This gap highlights the difference of the algorithmic ideas behind 
the two results. Whether the gap can be improved to polynomial in p and q 
is an open question. 

While we have dealed with the abstract notions of "monotone sums" or 
semigroup sums, in applications they most often materialise as maximisation 
or minimisation, as described in the next paragraphs. Also, in applications 
local terms are usually combined not only by one (monotone) operation but 
two different operations, such as "min" and "+". To facilitate the treatment 
of such applications, we extend the semigroup to a semiring (5, ©, 0) by 
introducing a product operation "0". Now the task is to evaluate 

(21) f(X)Qg(Y), 

x,y : xny=0 

where X and Y run through all p-subsets and g-subsets of [n] , respectively, 
and / and g are given mappings to S. We immediately observe that the 



expression (21) is equal to y e(7) g(Y), where the sum is over all q- 
subsets of [n] and e is as in Q. Thus, by our main result, it can be evaluated 
using a circuit with 0{{n p + n q ) logn) gates. 

3.1. Application to /c-paths. We apply the semiring formulation to the 
problem of counting the maximum- weight /c-edge paths from vertex s to vertex 
t in a given edge-weighted graph with real weights, where we assume that 
we are only allowed to add and compare real numbers and these operations 
take constant time (cf. |25] ) . By straightforward Bellman-Held-Karp type 
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dynamic programming [2J El HE] (or, even by brute force) we can solve the 
problem in (^)n°^ time. However, our main result gives an algorithm that 
runs in n k / 2 +°( l ) time by solving the problem in halves: Guess a middle 
vertex v and define fi(X) as the number of maximum- weight fc/2-edge paths 
from s to v in the graph induced by the vertex set X U {v}; similarly define 
gi(X) for the fc/2-edge paths from v to t. Furthermore, define f2(X) and 
as the respective maximum weights and put f(X) = (fi(X), f%(X)) 
and g(X) = (gi(X) , g2(X)) . These values can be computed for all vertex 
subsets X of size k/2 in (J 2 )n°W time. It remains to define the semiring 



operations in such a way that the expression (21) equals the desired number 
of k-edge paths; one can verify that the following definitions work correctly: 
(c, w) (c', w') = (c • d, w + w') and 

(c, w) if w > w', 

(c,tu)® (c',w') = {{c',w') \iw<w', 
(c + c', w) if it? = it/. 

[[Please see Appendix [B] for details.]] 

Thus, the techniques of the present paper enable solving the problem 
essentially as fast as the fastest known algorithms for the special case of 
counting all the /c-paths, for which quite different techniques relying on 
subtraction yield ( fc ™2) n °^ time bound [7 . On the other, for the more 
general problem of counting weighted subgraphs Vassilevska and Williams 
|24| give an algorithm whose running time, when applied to /c-paths, is 
0(n a;fc//3 +n 2fc//3+c ), where w < 2.3727 is the exponent of matrix multiplication 
and c is a constant; this of course would remain worse than our bound even 
if u) = 2. 



3.2. Application to matrix permanent. Consider the problem of com- 
puting the permanent of a k x n matrix (a^) over a noncommutative semir- 
ing, with k < n and even for simplicity, given by J2a ai CT (l)«2a(2) ■ ■ • a ka(k), 
where the sum is over all injective mappings a from [k] to [n]. We ob- 
serve that the expression (21) equals the permanent if we let p = q = 
k/2 = £ and define f(X) as the sum of a\a(i) a 2u(2) ' "' a icr{t) over an in- 
jective mappings a from {1,2, ...,£} to X and, similarly, g(Y) as the 
sum of a e+la ( e+1 ) a t+ 2a(t+2) '•' a k a{k) over all injective mappings a from 
{£ + l,£ + 2,...,k} to Y. Since the values f(X) and g(Y) for all rele- 
vant X and Y can be computed by dynamic programming in (J^) 77,0 
time, our main result yields the time bound n k l 2 + 0< y l ) for computing the 
permanent. 

Thus we improve significantly upon a Bellman-Held-Karp type dynamic 
programming algorithm that computes the permanent in {^ n °^ time, the 
best previous upper bound we are aware of for noncommutative semirings 
|S]. It should be noted, however, that essentally as fast algorithms are 
already known for noncommutative rings 0, and that faster, 2 k n°^ time, 
algorithms are known for commutative semirings [HI [21] • 

li 



3.3. Application to feature selection. The extensively studied feature 
selection problem in machine learning asks for a subset X of a given set of 
available features A so as to maximise some objective function f(X). Often 
the size of X can be bounded from above by some constant k, and sometimes 
the selection task needs to be solved repeatedly with the set of available 
features A changing dynamically across, say, the set [n] of all features. Such 
constraints take place in a recent work |10] on Bayesian network structure 
learning by branch and bound: the algorithm proceeds by forcing some 
features, I, to be included in X and some other, E, to be excluded from 
X. Thus the key computational step becomes that of maximising f(X) 
subject to/CJC[ji]\E and \X\ < k, which is repeated for varying / and 
E. We observe that instead of computing the maximum every time from 
scratch, it pays off precompute a solution to (p, g)-disjoint summation for 
all < p, q < k, since this takes about the same time as a single step for 
1 = and any fixed E. Indeed, in the scenario where the branch and bound 
search proceeds to exclude each and every subset of k features in turn, but 
no larger subsets, such precomputation decreases the running time bound 
quite dramatically, from 0(n 2k ) to 0(n k ); typically, n ranges from tens to 
some hundreds and k from 2 to 7. Admitted, in practice, one can expect the 
search procedure match the said scenario only partially, and so the savings 
will be more modest yet significant. 

Acknowledgement. We thank Jukka Suomela for useful discussions. 
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APPENDIX 

Appendix A. Valiant's Construction and Generalisations 

This section reviews Valiant's [23] circuit construction for (p, g r )-disjoint 
summation in the case p = q = 1. We also present minor generalisations 
to the case when either p = 1 or q = 1. For ease of exposition we follow 
the conventions and notation from Sect. |2] Accordingly, we assume that 
n = 2 b for a nonnegative integer b and identify the elements of [n] with 
binary strings in {0,1}*'. 

A.l. Valiant's construction. For p = q = 1, the disjoint summation 
problem reduces to the following form: given /: {0, l} 6 — > S as input, 
compute e : {0, l} b — > S defined for all y, x G {0, l} b by 

(22) e(y) = /(*) . 

Valiant's construction computes these sums by first computing intermediate 
functions hj : {0, l} e -> S defined for £ = 1,2, ... ,b and w G {0, 1} £ by 

(23) h+(w) = f(x) , 

and hj : {0, l} e -»• 5 defined for £ = 1, 2, . . . , b and « G {0, 1} £ by 

(24) ^-(«)= f(x). 

xe{o,i} b \{u) 



The solution to (22) can then be recovered as e(y) = l\(y) for all y G {0, l} 6 . 

The values (23) can be computed for t = b, b — 1, . . . , 2, 1 using the 
recurrence 

hUx) = f[x) 

(25) 

fc/H = e hf +1 (wi) . 



Assuming that functions fit have been computed for all £, the values (24) 
can then be computed for £ = 1,2,... ,b as 

/j-(u) = /i+(l - u) 
hj(ui) = /i7-i( n ) © /i/(w(l - i)) ■ 



(26) 



The number of ©-gates to implement (|25j) and ( 26 ) as a circuit is exactly 

6-1 b 



2* + 2 * = 3n - 6 = °( ra ) • 



i=l i=2 



A. 2. Generalisation for p = 1 and g > 1. For p = 1 and q > 1, the 

(p, g)-disjoint summation problem reduces to computing 



x) 

x:x4Y 



(27) e(Y) = /( 

where z G {0, l} b and Y G ( {0 ' (? 1}i, )• 



To evaluate (27), we proceed analogously to the p = q = 1 case, first 



computing functions : {0, 1}^ — > S as defined in (23). The second set of 
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intermediate functions now consists of functions : ) ~ ^ S, defined 

for £ = 0, 1, . . . , b and U G ( {0 ^ } ') by 



(28) 



hj(U)= fix). 

x£{0,l} b \(U) 



Then we have e(Y) = h^(Y) for all Y G f* '^)- 

The values /il" are computed as in Valiant's construction. For t = 0,1, ... ,b 
and U G ( {0 £ } *) , define 

U = {x(l -i): xieU and x(l - i) <£ U} . 



Now we can evaluate (28) using recurrence 
(29) 



ho({e}) = 
fr-(0) = h+(0)®h+(l) 
hJ(U) = hJ^UU-i) © h+{x) . 



xeU 



The number of ©-gates to evaluate the values ht is n — 2. To determine 
the number of ©-gates to evaluate the values h7, we note that \U\ < \U\, 
and thus \U\ < q gates are used for any U G \}°iq )• Thus, the total number 
of ©-gates is at most 



1 + 



u 



'{0, 1}* 



q b 



For positive integers i and I we have 



2 b / 2 b-i _ f 
¥i\ i-1 



< 



(2 b - I s 
2H \ i-1 



If 2 

2*1 i 



Thus for positive integers i it holds that 



E 



6-1 (nb-l\ 



,6\ 6-1 



< 



E 

£=0 



2'' 



<2 



and hence 



9 6 



O 9 



i=0 £=1 



i=i 



That is, the total number of ©-gates used is 0(q(™ q )). 



A. 3. Generalisation for p > 1 and g = 1. For p > 1 and 

(p, g)-disjoint summation problem reduces to computing 



1, the 



(30) 



e(y)= f(X), 
X:y$X 



where X G ( {0 p 1}6 ) and y G {0, l} b . 
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Now the first intermediate functions nucleate the inputs using tree- projection, 
that is, we define hf : ( {0 £ }/ ) ->■ S for t = 0, 1, . . . , b and W G ( {0 ^ } ') by 

(31) h+(W)= f{X). 

We define the second intermediate functions hj : {0, l} e —> S for all t = 
0,1, ... ,b and u G {0, l} e by 

(32) hj{u)= f(W), 

W:u0V 

where W ranges over )• Again e(y) = h^(y) for all y G {0, l} b . 

By Q and Lemma [l] we can be compute ( |3~T] ) by the recurrence 

fc+pQ = /(X) 

fc+(W)= fc+ +1 (Z). 



Similarly, we can compute (32) by the recurrence 

/»7(a») = /i7_ 1 (x)© /»+({a?(l-t)}UW)- 

The number of ©-gates to evaluate /i^ is ) — 1, and to evaluate hj at 
most 

h \up-d) hh \ > j 

" hh i+l \ i I 

b P-l / 2^ \ 



£=2 i=0 



s,+ 'S(3-°('(E 



Thus, the total number of ©-gates used is 

A. 4. Generalisation for p > 1 and g > 1. It is an open problem whether 
a circuit of size 0(p(™) + ^Cu)) exists when p, q > 1. Our main result in 
Sect. [2] gives a circuit of size 0((n p + n 9 ) logn). 

Appendix B. Count-weight semirings 

This section reviews the count-weight semiring used in counting heaviest 
A;-paths as described in §3.1[ The following theorem is quite standard, but 
we give a detailed proof for the sake of completeness. 
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Theorem 7. The Cartesian product N x (1U {— oo}) is a commutative 
semiring when equipped with operations 

!(c, w) if w > v, 

(d, v) ifw< b, 

(c+ d,w) if w = b 

and 

(34) (c, w) (d, v) = {cd, w + v) . 

Proof. In the following, we will use the Iverson bracket notation; that is, if 
P is a predicate we have [P] = 1 if P is true and [P] = if P is false. For 
the maximum of two elements x, y £ M, we write x V y = max(x, y). We note 
that V is a associative, commutative binary operation on M U {— oo}. 
First, let us note that it follows directly from (33) that 

(c, w) ffi (d, v) = (c [w = w V v] + d [v = w V v] , w V v) . 

We now prove the claim using this observation and the known properties of 
+ and V. 

Associativity of®. The associativity of ffi follows from the associativity of + 
and V, as we have 

(ci,u>i) © ((c 2 ,w 2 ) © (c 3 ,w 3 )) 

= (ci, Wi) © (c 2 [u>2 = U>2 V W 3 ] + C 3 [u> 3 = W 2 V 10 3 ] , W 2 V W3) 



o 

= °i K = TWl V (w 2 V W3)] , 1«l V (lt> 2 V W 3 ) 

i=l 
3 

= Q = ( Wl V u ' 2 ) V ^3], (^i V it> 2 ) V 103 

= (ci = u; 1 V w 2 ] + c 2 [w 2 = wi V iy 2 ], «>i V w 2 ) © (c 3 , w 3 ) 
= ((ci,i0i) © (c 2 ,w 2 )) © (C3,W 3 ) . 

Commutativity of © . The commutativity of © follows from the commutativity 
of + and V, as we have 

(c, w) © (d, v) = (c [w = w V v] + d [v = w V v] , w V u) 
= (d [« = w V ti] + c [ro = w V «] , id V d) 
= (d,u) © (c,w) . 

Existence of additive identity. We have that (0, — oo) is the identity element 
for ffi, as we have 

(c, w) ffi (0, — oo) = (c [w = w V — oo] + [— oo = «)V — oo], w V — oo) 
= (c,w) • 
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Associativity of®. We have 

(c,w) © ((d,v) © (e,u)) 

= (c, w) © (de, v + u) 
= (c(de),w + (v + u)) 
= ((cti)e, (u? + 1>) + u) 
= (c<i, iu + -u) © (e, 
= ((c,w) © © (e,u) . 

Commutativity of 0. We have 

(c, (d, v) = (cd, w + v ) = (<ic, t> + u>) = (d, t>) (c, w) . 

Existence of multiplicative identity. The multiplicative identity element is 
(1,0), since 

(c, w) (1, 0) = (cl, w + 0) = (c, w) . 
Distributivity. As is commutativity, it suffices to prove that multiplication 
from left distributes over addition. We have 

(c,w) ((d,v) © (e,u)) 

= (c,w) O (d[v = v V u] + e[u = v V u],v V u) 

= (c(d[v = v V u] + e[u = v V u]),w + (v V u)) 

= (cd [v = v V u] + ce [u = v V u] , (v + w) V + w)) 

= (cd [w + v = (w + v) V (w + u)] + ce[w + u = (w + v) V (w + u)] , 

(w + f ) V (w + u)) 

= (cd, tt; + v) © (ce, u> + u) 

= ((c, (d, v)) © ((c, (d, v)) . 

Annihilation in multiplication. Finally, we have that the additive identity 
element annihilates in multiplication, that is, 

(c, w) (0, — oo) = (cO, w H — oo) = (0, — oo) . 

□ 
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